Open Information Extraction for the Web
نویسندگان
چکیده
1 3 , 8 1 0 , 0 0 0 T u p l e s ? P r i m a r y E n t i t i e s ? R e l a t i o n s F i l t e r i n g Figure 4.2: Open Extraction from Wikipedia: TextRunner extracts 32.5 million distinct assertions from 2.5 million Wikipedia articles. 6.1 million of these tuples represent concrete relationships between named entities. The ability to automatically detect synonymous facts about abstract entities remains an open problem. Open Extraction from The General Web What happens when we augment the size of TextRunner’s input corpus by several orders of magnitude? In addition to processing Wikipedia, we added 500 million Web pages to the set of documents processed by TextRunner. This combination of Wikipedia and the Web is thus referred to as General-Web. After eliminating extractions found only in a single sentence, TextRunner was found to extract approximately 850 million raw tuples from General-Web, with 218 million tuples representing unique facts. Of these 218 million, 16.5 million tuples represent concrete facts; 14 million concrete facts remained after applying the aforementioned distributional The author wishes to thank Google Inc. for providing the corpus.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملLODIE: Linked Open Data for Web-scale Information Extraction
This work analyzes research gaps and challenges for Web-scale Information Extraction and foresees the usage of Linked Open Data as a groundbreaking solution for the field. The paper presents a novel methodology for Web scale Information Extraction which will be the core of the LODIE project (Linked Open Data Information Extraction). LODIE aims to develop Information Extraction techniques able t...
متن کاملFrom hyperlinks to Semantic Web properties using Open Knowledge Extraction
Open information extraction approaches are useful but insufficient alone for populating the Web with machine readable information as their results are not directly linkable to, and immediately reusable from, other Linked Data sources. This work proposes a novel Open Knowledge Extraction approach that performs unsupervised, open domain, and abstractive knowledge extraction from text for producin...
متن کاملEXTRACTION-BASED TEXT SUMMARIZATION USING FUZZY ANALYSIS
Due to the explosive growth of the world-wide web, automatictext summarization has become an essential tool for web users. In this paperwe present a novel approach for creating text summaries. Using fuzzy logicand word-net, our model extracts the most relevant sentences from an originaldocument. The approach utilizes fuzzy measures and inference on theextracted textual information from the docu...
متن کامل